R - An Introduction

MSDA - Bootcamp 2025 Summer

KT Wong

Faculty of Social Sciences, HKU

2025-07-30

The materials in this topic are drawn from Wickham and Grolemund (2023), Wickham (2019) and Wickham (2016) as well as other sources, including Princeton Sociology Methods Camp 2023. The materials are for educational purposes only.

Introduction

  • R is a programming language and software environment for statistical computing and graphics
    • widely used in academia and industry
    • for data analysis, statistical modeling, and visualization
    • for machine learning, data mining, and big data analysis
    • for reproducible research and scientific computing
  • R is an implementation of the S programming language
    • S was developed at Bell Laboratories by John Chambers and colleagues
    • S was designed for data analysis and graphics
    • S was the precursor to the commercial statistical software package S-PLUS

Introduction

  • R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand
    • named partly after the first names of the first two R authors and partly as a play on the name of S
    • a GNU project, a free software movement initiated by Richard Stallman in 1983
    • freely available under the GNU General Public License
    • a large number of packages available

Introduction

  • tidyverse is a collection of R packages designed for data science
    • tidyverse packages share an underlying design philosophy, grammar, and data structures
    • tidyverse packages are designed to work together
  • focus on the tidyverse
    • mainly because it is easier to understand
    • do most data manipulation we need in social science research with these tools
  • A good reference for tidyverse is the book by Wickham (2019)

Introduction

  • for quick reference, visit Posit Primers on Data Science
    • R Basics
    • Transform Tables
    • Visualize Data
  • for more advanced topics, visit Data Science
    • Data Wrangling
    • Data Visualization
    • Modelling
  • we use cover some base R functions first, then move on to the tidyverse

Loading Data

  • The first step in any data analysis is to load the data into R
    • the data can be in a variety of formats
    • the most common formats are CSV, Excel, and SPSS
    • the readr package is part of the tidyverse and is used to read data into R
Code
#install.packages("tidyverse")
library(tidyverse)


# check your working directory
getwd()

## read data file

addh<- read_csv("https://raw.githubusercontent.com/kwan-MSDA/Bootcamp_2024/main/dataset/addhealthfake.csv")

Add health dataset

  • Here are some details about the dataset
    • The National Longitudinal Study of Adolescent to Adult Health (Add Health)
    • a longitudinal study of a nationally representative sample of adolescents in grades 7-12 in the United States during the 1994-95 school year (Wave I)
    • The Add Health cohort has been followed into young adulthood with four in-home interviews for Wave I-IV by 2008
      • the Wave V conducted during 2016-2018 included a mixed-mode survey
    • respondents’ social, economic, psychological, and physical well-being
      • along with contextual data on the family, neighborhood, community, school, friendships, peer groups, and romantic relationships
    • study developmental trajectories of health and risk behaviors throughout the life course

Explore data

  • The dataset used here is a subset of the Add Health dataset

    • 3000 observations and 11 variables
  • After the dataset is loaded in R, it is important to explore the data to understand its structure and content

    • check the data types of each variable
    • check the dimensions of the data
    • look at a few rows and variables
Code
class(addh$age)
[1] "numeric"
Code
class(addh$gender)
[1] "character"
Code
class(addh$love)
[1] "numeric"

Explore data

  • For more information on the dataset,
    • summary(): numeric summaries
    • str(): data types and sample data
    • colnames() or names(): names of columns/variables
    • dim(): dimensions
    • View(): view all data in RStudio viewer
    • head(): top 10 rows
    • tail(): bottom 10 rows

Explore data

Code
str(addh)
spc_tbl_ [3,000 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ id           : num [1:3000] 1 2 3 4 5 6 7 8 9 10 ...
 $ age          : num [1:3000] 18 22 18 26 27 21 19 27 18 25 ...
 $ gender       : chr [1:3000] "female" "male" "female" "female" ...
 $ income       : num [1:3000] 19252 11617 16189 18194 24484 ...
 $ logincome    : num [1:3000] 9.87 9.36 9.69 9.81 10.11 ...
 $ debt         : chr [1:3000] "yesdebt" "nodebt" "yesdebt" "yesdebt" ...
 $ love         : num [1:3000] 1 10 10 2 5 10 3 4 1 6 ...
 $ nocheating   : num [1:3000] 7 10 3 1 10 4 10 10 10 3 ...
 $ money        : num [1:3000] 9 3 5 3 9 9 9 7 3 8 ...
 $ paypercent   : num [1:3000] 46 56 42 82 93 42 89 55 43 53 ...
 $ logpaypercent: num [1:3000] 3.83 4.03 3.74 4.41 4.53 ...
 - attr(*, "spec")=
  .. cols(
  ..   id = col_double(),
  ..   age = col_double(),
  ..   gender = col_character(),
  ..   income = col_double(),
  ..   logincome = col_double(),
  ..   debt = col_character(),
  ..   love = col_double(),
  ..   nocheating = col_double(),
  ..   money = col_double(),
  ..   paypercent = col_double(),
  ..   logpaypercent = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
Code
head(addh, n=5)
# A tibble: 5 × 11
     id   age gender income logincome debt     love nocheating money paypercent
  <dbl> <dbl> <chr>   <dbl>     <dbl> <chr>   <dbl>      <dbl> <dbl>      <dbl>
1     1    18 female 19252.      9.87 yesdebt     1          7     9         46
2     2    22 male   11617.      9.36 nodebt     10         10     3         56
3     3    18 female 16189.      9.69 yesdebt    10          3     5         42
4     4    26 female 18194.      9.81 yesdebt     2          1     3         82
5     5    27 female 24484.     10.1  yesdebt     5         10     9         93
# ℹ 1 more variable: logpaypercent <dbl>

Explore data

  • To get information about one variable, use the following functions:
    • table(): get a table summarizing counts
    • unique(): get the unique responses for a variable
    • sort(): sort the numerically (or alphabetically)
    • hist(): produce a histogram
Code
##| output-location: fragment

table(addh$gender)

female   male 
  1503   1497 
Code
##| output-location: fragment

table(addh$age)

 18  19  20  21  22  23  24  25  26  27 
306 299 300 315 303 265 301 278 296 337 
Code
##| output-location: fragment

sort(unique(addh$age))
 [1] 18 19 20 21 22 23 24 25 26 27

Subset data

  • use the base R subsetting syntax with [row index, column index]
Code
##| output-location: fragment

# get first column, rows 1 through 3 
addh[1:3,1]
# A tibble: 3 × 1
     id
  <dbl>
1     1
2     2
3     3
  • Exercise: How would you subset the observation in the third row and the fifth column?
Code
##| output-location: fragment

# get everything besides first row
addh[-1, ]
# A tibble: 2,999 × 11
      id   age gender income logincome debt     love nocheating money paypercent
   <dbl> <dbl> <chr>   <dbl>     <dbl> <chr>   <dbl>      <dbl> <dbl>      <dbl>
 1     2    22 male   11617.      9.36 nodebt     10         10     3         56
 2     3    18 female 16189.      9.69 yesdebt    10          3     5         42
 3     4    26 female 18194.      9.81 yesdebt     2          1     3         82
 4     5    27 female 24484.     10.1  yesdebt     5         10     9         93
 5     6    21 female 22353.     10.0  nodebt     10          4     9         42
 6     7    19 male   11842.      9.38 yesdebt     3         10     9         89
 7     8    27 female 19874.      9.90 nodebt      4         10     7         55
 8     9    18 male   27422.     10.2  nodebt      1         10     3         43
 9    10    25 female  9968.      9.21 yesdebt     6          3     8         53
10    11    24 female 26354.     10.2  nodebt     10         10    10         52
# ℹ 2,989 more rows
# ℹ 1 more variable: logpaypercent <dbl>

Exercises

  • Suppose that you want to know a few things from the dataset:
    • What’s the median income of this sample? What’s the mean age?
    • On average, do the young adults surveyed think money, no cheating, or love is more important in a relationship?
    • What are the answer choices for debt?
  • hint: for simple calculations, you can use the base R functions mean(), median(), and table()
Code
##| output-location: fragment

median(addh$income)
[1] 15127.34
Code
mean(addh$age)
[1] 22.51133
Code
unique(addh$debt)
[1] "yesdebt" "nodebt" 
Code
# as a precursor to the next section, we can use dplyr to do the same thing

library(dplyr)
summary(addh,
        mean_money = mean(money),
        mean_nocheating = mean(nocheating),
        mean_love = mean(love))
       id              age           gender              income     
 Min.   :   1.0   Min.   :18.00   Length:3000        Min.   : 1008  
 1st Qu.: 750.8   1st Qu.:20.00   Class :character   1st Qu.: 9372  
 Median :1500.5   Median :22.00   Mode  :character   Median :15127  
 Mean   :1500.5   Mean   :22.51                      Mean   :15231  
 3rd Qu.:2250.2   3rd Qu.:25.00                      3rd Qu.:20518  
 Max.   :3000.0   Max.   :27.00                      Max.   :41700  
                                                                    
   logincome          debt                love          nocheating    
 Min.   : 3.292   Length:3000        Min.   : 1.000   Min.   : 1.000  
 1st Qu.: 9.222   Class :character   1st Qu.: 5.000   1st Qu.: 5.000  
 Median : 9.650   Mode  :character   Median :10.000   Median :10.000  
 Mean   : 9.482                      Mean   : 7.707   Mean   : 7.694  
 3rd Qu.: 9.939                      3rd Qu.:10.000   3rd Qu.:10.000  
 Max.   :10.638                      Max.   :10.000   Max.   :10.000  
 NA's   :97                                                           
     money          paypercent     logpaypercent  
 Min.   : 1.000   Min.   :  1.00   Min.   :0.000  
 1st Qu.: 3.000   1st Qu.: 25.00   1st Qu.:3.219  
 Median : 6.000   Median : 51.00   Median :3.932  
 Mean   : 5.569   Mean   : 50.45   Mean   :3.629  
 3rd Qu.: 8.000   3rd Qu.: 76.00   3rd Qu.:4.331  
 Max.   :10.000   Max.   :100.00   Max.   :4.605  
                                                  

Data Manipulation

  • The dplyr package is part of the tidyverse and is used for data manipulation

  • dplyr functions include:

    • filter(): subset rows
    • select(): subset columns
    • mutate(): create new variables
    • summarise(): summarize data
    • arrange(): sort data
    • group_by(): group data
  • very important function: pipe operator %>% from the magrittr package

    • allows you to chain functions together
  • basic structure of the dplyr functions

    • function(dataframe, operation 1 to perform, opertaion 2 to perform,…)

Data Manipulation

dplyr - select

  • it can be used with operations, like
    • starts_with()
    • ends_with()
    • contains()
    • matches()
    • etc…
Code
pay_variables <- select(addh, contains("pay"))

head(pay_variables, 5)
# A tibble: 5 × 2
  paypercent logpaypercent
       <dbl>         <dbl>
1         46          3.83
2         56          4.03
3         42          3.74
4         82          4.41
5         93          4.53

Data Manipulation

dplyr - filter

  • filter rows based on conditions
Code
nodebt_income <- filter(addh, debt == "nodebt" & income >= 10000)

nrow(nodebt_income)
[1] 1096
Code
nomissing_income <- filter(addh, !is.na(income)) # only keep obs that are NOT (!) na

#nomissing_income <- drop_na(addh, income) # alternate function from tidyr

nrow(nomissing_income)
[1] 3000

Data Manipulation

dplyr - arrange

  • sort data based on one or more columns

  • task: find the two observations who think money is extremely important for a relationship (10 on money) but who pay for the fewest percentage of dates (paypercent)

Code
addh %>%
  filter(money == 10) %>%
  arrange(paypercent) %>%
  head(2)
# A tibble: 2 × 11
     id   age gender income logincome debt     love nocheating money paypercent
  <dbl> <dbl> <chr>   <dbl>     <dbl> <chr>   <dbl>      <dbl> <dbl>      <dbl>
1   811    22 male   34161.     10.4  yesdebt    10          9    10          2
2  2086    20 male    4816.      8.48 yesdebt    10         10    10          2
# ℹ 1 more variable: logpaypercent <dbl>

Data Manipulation

dplyr - mutate

  • create new variables added to the dataset

  • task: add a variable with the average rating for nocheating, money, and love’s importance for a relationship (sum divided by 3) and another variable that logs that rating

Code
addh<- mutate(addh,
              rateavg=(love + money + nocheating)/3,
              rateavglog=log(rateavg))

head(addh, 5)
# A tibble: 5 × 13
     id   age gender income logincome debt     love nocheating money paypercent
  <dbl> <dbl> <chr>   <dbl>     <dbl> <chr>   <dbl>      <dbl> <dbl>      <dbl>
1     1    18 female 19252.      9.87 yesdebt     1          7     9         46
2     2    22 male   11617.      9.36 nodebt     10         10     3         56
3     3    18 female 16189.      9.69 yesdebt    10          3     5         42
4     4    26 female 18194.      9.81 yesdebt     2          1     3         82
5     5    27 female 24484.     10.1  yesdebt     5         10     9         93
# ℹ 3 more variables: logpaypercent <dbl>, rateavg <dbl>, rateavglog <dbl>
  • Caution
    • using the same column name or same object name, you overwrite the original object or column

Data Manipulation

dplyr - group_by and summarise

  • group data by one or more variables and then summarize the data according to the groups

  • task: find the average “not cheating importance” for different gender

Code
addh %>% 
  group_by(gender) %>% 
  summarize(mean_nocheating = mean(nocheating))
# A tibble: 2 × 2
  gender mean_nocheating
  <chr>            <dbl>
1 female            7.79
2 male              7.60
  • Summarise has a number of operations for creating summary statistics
    • mean(), median(), min(), max(), sd(), n(), n_distinct(), first, last, etc…

Exercises

  1. Exercise one
    1. the number of females and males by debt status
    2. the percentage in each (debt x gender) category as a fraction of all observations
    3. the number of distinct ratings of love’s importance in each of these debt x gender categories
Code
addh %>% 
  group_by(gender, debt) %>%
  summarize(percentage = n()/nrow(addh),
            n_distinct_love = n_distinct(love))
# A tibble: 4 × 4
# Groups:   gender [2]
  gender debt    percentage n_distinct_love
  <chr>  <chr>        <dbl>           <int>
1 female nodebt       0.256              10
2 female yesdebt      0.245              10
3 male   nodebt       0.248              10
4 male   yesdebt      0.251              10

Exercises

  1. Exercise two
  • Group the data by gender and debt status first
    • Find the average rating of love, no cheating, and money’s importance for a relationship in each group
    • Arrange the groups by their rating of money’s importance to a relationship from the highest to rating to the lowest rating
Code
addh %>% 
  group_by(gender, debt) %>%
  summarize(mean_love = mean(love),
            mean_nocheating = mean(nocheating),
            mean_money = mean(money)) %>%
  arrange(desc(mean_money))
# A tibble: 4 × 5
# Groups:   gender [2]
  gender debt    mean_love mean_nocheating mean_money
  <chr>  <chr>       <dbl>           <dbl>      <dbl>
1 male   yesdebt      7.76            7.72       5.66
2 female yesdebt      7.57            7.75       5.59
3 female nodebt       7.82            7.83       5.54
4 male   nodebt       7.68            7.47       5.49

Recoding Variables

  • Recoding variables is a common task in data analysis for social science research
    • convert a variable from one format to another
    • create a new variable based on the values of an existing variable (or of multiple existing variables)
  • some typical recoding tasks
    • convert a continuous variable to a categorical variable
    • convert a categorical variable to a continuous variable
    • create categorical variables based on conditions
  • Our focus
    • data types
    • logical statements
    • case_when() function

Recoding Variables

changing data types

  • use the mutate() function to change the data type of a variable
    • as.character()
    • as.numeric()
    • as.factor()
    • as.integer()
    • as.logical()
Code
addh2 <- addh %>% 
            mutate(
               age = as.character(age),
               debt = as.factor(debt)
               )

head(addh2, 3)
# A tibble: 3 × 13
     id age   gender income logincome debt     love nocheating money paypercent
  <dbl> <chr> <chr>   <dbl>     <dbl> <fct>   <dbl>      <dbl> <dbl>      <dbl>
1     1 18    female 19252.      9.87 yesdebt     1          7     9         46
2     2 22    male   11617.      9.36 nodebt     10         10     3         56
3     3 18    female 16189.      9.69 yesdebt    10          3     5         42
# ℹ 3 more variables: logpaypercent <dbl>, rateavg <dbl>, rateavglog <dbl>

Recoding Variables

create a vector

  • use c to string together the elements
Code
agevec<- c(18, 21, 23, 25, 27, 30)

agevec
[1] 18 21 23 25 27 30
Code
class(agevec)
[1] "numeric"
Code
gendervec <- c("male", "female", "other", "female", "female", "male")
gendervec
[1] "male"   "female" "other"  "female" "female" "male"  
Code
class(gendervec)
[1] "character"

Recoding Variables

create a vector

  • Elements in a vector need to be of the same type, otherwise, type coercion happens
Code
c(28, "28", TRUE)
[1] "28"   "28"   "TRUE"
Code
c(28, "28", TRUE) %>% class()
[1] "character"
Code
c(1,2,3, TRUE, FALSE)
[1] 1 2 3 1 0
Code
c(1,2,3, TRUE, FALSE) %>% class()
[1] "numeric"

Recoding Variables

data types

  • convert from one type to another using the following functions:
    • as.numeric()
    • as.character()
    • as.factor()
Code
as.character(agevec)
[1] "18" "21" "23" "25" "27" "30"
Code
as.numeric(gendervec)
[1] NA NA NA NA NA NA

Recoding Variables

data types

  • Vectors can have a factor type
    • looks like a character vector
    • but is actually a number under the hood (“labelled data”)
Code
genderfactorvec<- factor(gendervec,
                         levels=c("male", "female", "other"))

genderfactorvec
[1] male   female other  female female male  
Levels: male female other
Code
class(genderfactorvec)
[1] "factor"
Code
as.numeric(genderfactorvec)
[1] 1 2 3 2 2 1

Recoding Variables

create a vector

  • there are functions to help you create the vector more efficiently:
    1. rep: repeat the same thing multiple times
    2. seq: create a sequence of numbers
    3. paste: stick together character and numeric info
    4. sample: for vectors where we want to randomly sample from some larger pool
Code
rep(1, 5)
[1] 1 1 1 1 1
Code
seq(from=1997, to=2024, by=5)
[1] 1997 2002 2007 2012 2017 2022
Code
paste("age", seq(from=22, to=30, by=1),
      sep="_")
[1] "age_22" "age_23" "age_24" "age_25" "age_26" "age_27" "age_28" "age_29"
[9] "age_30"

Recoding Variables

from vectors to dataframes

  • One way to create a dataframe
    • use bind_cols() to attach same-length vectors together as columns in a tibble
    • Vectors can be different types
Code
bind_cols(age=agevec, gender=gendervec)
# A tibble: 6 × 2
    age gender
  <dbl> <chr> 
1    18 male  
2    21 female
3    23 other 
4    25 female
5    27 female
6    30 male  

Exercises

Code
addh2<- addh %>% 
            mutate(gender=factor(gender,
                                 levels=c("male", "female")
                                 )
               )

str(addh2$gender)
 Factor w/ 2 levels "male","female": 2 1 2 2 2 2 1 2 1 2 ...
Code
vec1<- as.character(addh2$gender)
vec2<- as.numeric(addh2$gender)
  • convert the variable gender in addh to a factor variable

  • what happens if you try to convert the variable to character by using as.character after the factor conversion

  • what happens if you try to convert the variable to number by using as.numeric after the factor conversion

Recoding Variables

Matrices in R

  • A matrix is a two-dimensional array
    • all elements must be of the same type
    • can be created using the matrix() function
    • can be created from a vector using the dim() function
Code
matrix1<- matrix(1:15, nrow=3, ncol=5)

matrix1
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    4    7   10   13
[2,]    2    5    8   11   14
[3,]    3    6    9   12   15

Recoding Variables

Matrices in R - some basic operations

Code
dim(matrix1)
[1] 3 5
Code
colnames(matrix1)<- c("A", "B", "C", "D", "E")
rownames(matrix1)<- c("X", "Y", "Z")

matrix1
  A B C  D  E
X 1 4 7 10 13
Y 2 5 8 11 14
Z 3 6 9 12 15
Code
matrix1[2,3]
[1] 8
Code
matrix1[2,]
 A  B  C  D  E 
 2  5  8 11 14 

Recoding Variables

Matrices in R - some basic operations

Code
A<- matrix(1:6, nrow=2, ncol=3)
B<- matrix(7:12, nrow=3, ncol=2)


print(A %*% B)
     [,1] [,2]
[1,]   76  103
[2,]  100  136
Code
print(B %*% A)
     [,1] [,2] [,3]
[1,]   27   61   95
[2,]   30   68  106
[3,]   33   75  117
Code
print(t(A))
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6

Recoding Variables

Matrices in R - some basic operations

Code
C<- matrix(c(2,5,3,1,3,6,2,9,5), nrow=3, ncol=3)
print(solve(C))
            [,1]       [,2]        [,3]
[1,]  1.14705882 -0.2058824 -0.08823529
[2,] -0.05882353 -0.1176471  0.23529412
[3,] -0.61764706  0.2647059 -0.02941176
Code
print(det(C))
[1] -34
Code
print(eigen(C))
eigen() decomposition
$values
[1] 12.502029 -3.320941  0.818912

$vectors
           [,1]        [,2]       [,3]
[1,] -0.1946720 -0.07646107 -0.8799055
[2,] -0.7265653 -0.79546586  0.1194865
[3,] -0.6589428  0.60115536  0.4598796

Recoding Variables

logical statements

  • let us start from discussing logical operators first

  • the main logical operators used in R are:

    • \(==\) (equal to)
    • \(!=\) (not equal to)
    • \(!\) (not)
    • \(<\) (less than)
    • \(<=\) (less than or equal to)
    • \(>\) (greater than)
    • \(>=\) (greater than or equal to)
    • \(\&\) (and)
    • \(|\) (or)

Recoding Variables

logical statements

  • logical statements are used to filter data, create new variables, and recode variables
    • ifelse() function
Code
addh2<- addh %>% 
            mutate(
               money_over_love = ifelse(money > love, 1, 0),
               .after = id
               )

head(addh2, 5)
# A tibble: 5 × 14
     id money_over_love   age gender income logincome debt     love nocheating
  <dbl>           <dbl> <dbl> <chr>   <dbl>     <dbl> <chr>   <dbl>      <dbl>
1     1               1    18 female 19252.      9.87 yesdebt     1          7
2     2               0    22 male   11617.      9.36 nodebt     10         10
3     3               0    18 female 16189.      9.69 yesdebt    10          3
4     4               1    26 female 18194.      9.81 yesdebt     2          1
5     5               1    27 female 24484.     10.1  yesdebt     5         10
# ℹ 5 more variables: money <dbl>, paypercent <dbl>, logpaypercent <dbl>,
#   rateavg <dbl>, rateavglog <dbl>
Code
addh3<- addh %>% 
            mutate(
               money_or_love = ifelse(money==love, "same",
                                      ifelse(love > money, "love greater", "money greater")),
               .after = id
               )

head(addh3, 5)
# A tibble: 5 × 14
     id money_or_love   age gender income logincome debt   love nocheating money
  <dbl> <chr>         <dbl> <chr>   <dbl>     <dbl> <chr> <dbl>      <dbl> <dbl>
1     1 money greater    18 female 19252.      9.87 yesd…     1          7     9
2     2 love greater     22 male   11617.      9.36 node…    10         10     3
3     3 love greater     18 female 16189.      9.69 yesd…    10          3     5
4     4 money greater    26 female 18194.      9.81 yesd…     2          1     3
5     5 money greater    27 female 24484.     10.1  yesd…     5         10     9
# ℹ 4 more variables: paypercent <dbl>, logpaypercent <dbl>, rateavg <dbl>,
#   rateavglog <dbl>

Recoding Variables

logical statements

  • use case_when() if there are 3 or more conditions for creating a variable

  • its syntax is the following:

    • case_when(logical condition ~ value assigned, logical condition 2 ~ value assigned. . . .default = value if does not fit other logical conditions)
Code
addh3<- addh %>% 
  mutate(
    money_or_love = case_when(
      money==love ~ "same",
      love > money ~ "love greater",
      TRUE ~ "money greater"
    ),
    .after = id
  )

head(addh3, 5)
# A tibble: 5 × 14
     id money_or_love   age gender income logincome debt   love nocheating money
  <dbl> <chr>         <dbl> <chr>   <dbl>     <dbl> <chr> <dbl>      <dbl> <dbl>
1     1 money greater    18 female 19252.      9.87 yesd…     1          7     9
2     2 love greater     22 male   11617.      9.36 node…    10         10     3
3     3 love greater     18 female 16189.      9.69 yesd…    10          3     5
4     4 money greater    26 female 18194.      9.81 yesd…     2          1     3
5     5 money greater    27 female 24484.     10.1  yesd…     5         10     9
# ℹ 4 more variables: paypercent <dbl>, logpaypercent <dbl>, rateavg <dbl>,
#   rateavglog <dbl>

Exercises

  • create a new variable called money_or_love in the addh dataset
    • the variable should have the following categories:
      • “extreme” if person either codes love or money as 9 or 10
      • “lovegreater” if love > money
      • “same” if love == money
      • “moneygreater” if money > love
      • NA if none of the above

Recoding Variables

create a binary vector

Code
income_75<- quantile(addh$income)[4]

addh2<- addh %>% 
  mutate(high_income=ifelse(income > income_75, 1, 0))

income_25 <- quantile(addh$income)[2]

addh2<- addh2 %>% 
  mutate(income_level=case_when(income <= income_25 ~ "low",
                                income >=income_75 ~"high", 
                                .default="medium"),
         .after = id)

head(addh2, 5)
# A tibble: 5 × 15
     id income_level   age gender income logincome debt    love nocheating money
  <dbl> <chr>        <dbl> <chr>   <dbl>     <dbl> <chr>  <dbl>      <dbl> <dbl>
1     1 medium          18 female 19252.      9.87 yesde…     1          7     9
2     2 medium          22 male   11617.      9.36 nodebt    10         10     3
3     3 medium          18 female 16189.      9.69 yesde…    10          3     5
4     4 medium          26 female 18194.      9.81 yesde…     2          1     3
5     5 high            27 female 24484.     10.1  yesde…     5         10     9
# ℹ 5 more variables: paypercent <dbl>, logpaypercent <dbl>, rateavg <dbl>,
#   rateavglog <dbl>, high_income <dbl>

Looping

  • Loops are used to repeat a block of code multiple times
    • for loop
    • while loop
    • repeat loop
    • break and next statements
  • Loops are useful for:
    • automating repetitive tasks
    • iterating over elements in a list or vector
    • creating new variables or dataframes
    • running simulations

For loops

  • The for loop is the most common type of loop in R
    • it repeats a block of code a specified number of times
    • it can go through every element of a vector
      • syntax: for (i in vector) {code to execute}
    • it can iterate through a set number of elements in a vector
      • syntax: for (i in 1:length(vector)) {code to execute}
  • The for loop is useful for:
    • creating new variables
    • running simulations
    • iterating over elements in a list or vector
    • automating repetitive tasks

For loops — sample means

Code
set.seed(123456)

sample_means<- numeric(length=1000)

for (i in seq_along(sample_means)) {
  samp<- sample(addh3$money, size = 800, replace=TRUE)
  sample_means[i]<- mean(samp)
}

mean(sample_means)
[1] 5.57076
Code
library(tidyverse)

ggplot(as.data.frame(sample_means), aes(sample_means)) +
  #geom_histogram(bins=30)+
  geom_density()+
  geom_vline(xintercept = mean(sample_means),
             color="red", linetype="dashed") +
  labs(title="Distribution of Sample Means",
       x="Sample Means",
       y="Frequency")+
  theme_minimal()

Write a function

  • Functions are blocks of code that can be reused
    • they take input arguments
    • they return output
    • they can be used in loops, apply functions, and other functions
    • they can be used to create new variables, summarize data, and run simulations
  • the basic structure of a function is:
    • function_name <- function(input arguments) {code to execute}
    • return(output)
  • we now touch upon some basics

functions - basics

  • Let us start with a simple function, z-score of a variable
Code
zscore<- function(x) {
  zscore<- (x - mean(x))/sd(x)
  return(zscore)
}

z_income<- zscore(addh$income)

head(z_income, 10)
 [1]  0.5221372 -0.4693026  0.1243644  0.3847371  1.2013581  0.9246830
 [7] -0.4400798  0.6028251  1.5828892 -0.6833786
  • more complicated example
Code
sample_means<- function(data, n, reps) {
  sample_means<- numeric(length=reps)
  
  for (i in seq_along(sample_means)) {
    samp<- sample(data, size = n, replace=TRUE)
    sample_means[i]<- mean(samp)
  }
  
  return(sample_means)
}
Code
age_sample_means<- sample_means(addh$age, 500, 1000)

ggplot(as.data.frame(age_sample_means), aes(age_sample_means)) +
  geom_density()+
  geom_vline(xintercept = mean(age_sample_means),
             color="red", linetype="dashed") +
  labs(title="Distribution of Sample Means",
       x="Sample Means of Age",
       y="Frequency")+
  theme_bw()

Exercises

  • create a function that incorporate both sample means functions and the density plot as above

  • plot the distribution of sample means for the variable love in the addh dataset using the function

Lists and map functions from purrr library

  • Lists are a way to store multiple objects in R
    • can store vectors, dataframes, and other lists
    • can store objects of different classes
Code
list1<- list(1, c(8,9,10,11,12), data.frame(x=1:10, y=11:20))

list1
[[1]]
[1] 1

[[2]]
[1]  8  9 10 11 12

[[3]]
    x  y
1   1 11
2   2 12
3   3 13
4   4 14
5   5 15
6   6 16
7   7 17
8   8 18
9   9 19
10 10 20

Lists and map functions from purrr library

  • The syntax is map(mylist, myfunction, functionoptions) and can change depending on the type of output for your analysis
Code
library(purrr)

map(list1, length)
[[1]]
[1] 1

[[2]]
[1] 5

[[3]]
[1] 2
Code
map(list1, class)
[[1]]
[1] "numeric"

[[2]]
[1] "numeric"

[[3]]
[1] "data.frame"
Code
map(list1, summary)
[[1]]
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      1       1       1       1       1       1 

[[2]]
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      8       9      10      10      11      12 

[[3]]
       x               y        
 Min.   : 1.00   Min.   :11.00  
 1st Qu.: 3.25   1st Qu.:13.25  
 Median : 5.50   Median :15.50  
 Mean   : 5.50   Mean   :15.50  
 3rd Qu.: 7.75   3rd Qu.:17.75  
 Max.   :10.00   Max.   :20.00  

Tidy approach to data

  • The tidy approach (Wickham (2014)) to data is a way to organize data in a consistent format
    • each variable is a column
    • each observation is a row
    • each type of observational unit is a table
    • each value is a cell

Reshape the data

let us start from a “real” example

Code
library(tidyr)

sleep_wide<- tibble(name=c("KT", "Olivia", "Dean"),
                    day1=c(8, 7, 6),
                    day2=c(6, 6, 5),
                    day3=c(5, 4, 4))

sleep_wide
# A tibble: 3 × 4
  name    day1  day2  day3
  <chr>  <dbl> <dbl> <dbl>
1 KT         8     6     5
2 Olivia     7     6     4
3 Dean       6     5     4
Code
sleep_long<- sleep_wide %>% 
  pivot_longer(cols=day1:day3,
               names_to="day",
               values_to="sleep_hours")
  


sleep_long
# A tibble: 9 × 3
  name   day   sleep_hours
  <chr>  <chr>       <dbl>
1 KT     day1            8
2 KT     day2            6
3 KT     day3            5
4 Olivia day1            7
5 Olivia day2            6
6 Olivia day3            4
7 Dean   day1            6
8 Dean   day2            5
9 Dean   day3            4
Code
library(stringr)

sleep_long %>% 
  mutate(day=stringr::str_sub(day,-1, -1))
# A tibble: 9 × 3
  name   day   sleep_hours
  <chr>  <chr>       <dbl>
1 KT     1               8
2 KT     2               6
3 KT     3               5
4 Olivia 1               7
5 Olivia 2               6
6 Olivia 3               4
7 Dean   1               6
8 Dean   2               5
9 Dean   3               4

Reshape the data

from long to wide

Code
sleep_wide2<- sleep_long %>% 
  pivot_wider(names_from=day,
              values_from=sleep_hours)

sleep_wide2
# A tibble: 3 × 4
  name    day1  day2  day3
  <chr>  <dbl> <dbl> <dbl>
1 KT         8     6     5
2 Olivia     7     6     4
3 Dean       6     5     4

Execises

Code
sleep_wide2<- tibble(name=c(rep("KT",2), rep("Olivia",2), rep("Dean",2)),
                     activity=rep(c("sleep", "play"),3),
                     day1=c(8, 2, 7, 2, 5, 3),
                     day2=c(6, 1, 1, 3, 6, 2),
                     day3=c(5, 1, 4, 1, 4, 3))

sleep_wide2
# A tibble: 6 × 5
  name   activity  day1  day2  day3
  <chr>  <chr>    <dbl> <dbl> <dbl>
1 KT     sleep        8     6     5
2 KT     play         2     1     1
3 Olivia sleep        7     1     4
4 Olivia play         2     3     1
5 Dean   sleep        5     6     4
6 Dean   play         3     2     3
  • how to express the data in a tidy format?
    • i.e. name, day, sleep, play as columns
Code
sleep_tidy<- sleep_wide2 %>% 
  pivot_longer(cols=day1:day3,
               names_to="day",
               values_to="hours") %>% 
  mutate(day=stringr::str_sub(day, -1, -1)) %>% 
  pivot_wider(names_from=activity,
              values_from=hours)

sleep_tidy
# A tibble: 9 × 4
  name   day   sleep  play
  <chr>  <chr> <dbl> <dbl>
1 KT     1         8     2
2 KT     2         6     1
3 KT     3         5     1
4 Olivia 1         7     2
5 Olivia 2         1     3
6 Olivia 3         4     1
7 Dean   1         5     3
8 Dean   2         6     2
9 Dean   3         4     3

Export data

  • The last step in data analysis is to export the data
    • save the data in a format that can be shared with others
    • save the data in a format that can be read by other software
    • save the data in a format that can be used in other software
  • Export command depends on the type of file you are trying to write to
    • write.csv for CSV
    • write.xslx for Excel
    • write.dta for Stata file
    • etc
  • By default, the new file will be saved in current working directory
    • If you want to save it elsewhere, need to specify the path
Code
write_csv(sleep_tidy, "sleep_tidy.csv")

library(haven)
write_dta(sleep_tidy, "c:/Users/KT/Downloads/sleep_tidy.dta")

Basic merge

  • The typical merge in R is left_join
    • keep all rows from “left” table even if observation doesn’t have matching row in “right” table
    • this will drop observations from the joining data when they are not matched to the “left” table
Code
sleep_tidy<- tibble(name=c("KT", "Olivia", "Dean", "May", "Mary"),
                     sleep=c(8, 7, 6, 5, 5),
                     play=c(2, 2, 3, 3, 2))

sleep_tidy2<- tibble(name=c("KT", "Olivia", "Dean", "Peter", "Susan"),
                      study=c(3, 4, 5, 2, 3),
                      work=c(8, 10, 9, 6, 5))

sleep_tidy3<- left_join(sleep_tidy, sleep_tidy2, by="name")

sleep_tidy3
# A tibble: 5 × 5
  name   sleep  play study  work
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 KT         8     2     3     8
2 Olivia     7     2     4    10
3 Dean       6     3     5     9
4 May        5     3    NA    NA
5 Mary       5     2    NA    NA
Code
sleep_tidy4<- left_join(sleep_tidy2, sleep_tidy, by="name")

sleep_tidy4
# A tibble: 5 × 5
  name   study  work sleep  play
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 KT         3     8     8     2
2 Olivia     4    10     7     2
3 Dean       5     9     6     3
4 Peter      2     6    NA    NA
5 Susan      3     5    NA    NA

Merge data

  • inner join Only keep rows of the first data.frame that have corresponding records in the second data.frame
Code
sleep_tidy5<- inner_join(sleep_tidy, sleep_tidy2, by="name", suffix=c("_sleep", "_work"))

sleep_tidy5
# A tibble: 3 × 5
  name   sleep  play study  work
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 KT         8     2     3     8
2 Olivia     7     2     4    10
3 Dean       6     3     5     9
  • full join Keep all rows from both dataframes, filling in missing values with NAs
Code
sleep_tidy6<- full_join(sleep_tidy, sleep_tidy2, by="name")

sleep_tidy6
# A tibble: 7 × 5
  name   sleep  play study  work
  <chr>  <dbl> <dbl> <dbl> <dbl>
1 KT         8     2     3     8
2 Olivia     7     2     4    10
3 Dean       6     3     5     9
4 May        5     3    NA    NA
5 Mary       5     2    NA    NA
6 Peter     NA    NA     2     6
7 Susan     NA    NA     3     5

ggplot2

it starts from the grammar of graphics Wickham (2016)

  • data
  • aesthetics
  • geoms
  • facets
  • stats
  • scales
  • coordinates
  • themes

ggplot2

  • Every ggplot2 plot has three key components:
    • data
    • A set of aesthetic mappings between variables in the data and visual properties
    • At least one layer which describes how to render each observation
      • Layers are usually created with a geom function

ggplot2

illustration

  • Use built-in dataset from ggplot2: mpg
    • information about the fuel economy of popular car models in 1999 and 2008
    • collected by the US Environmental Protection Agency
    • here are some of the variables in the dataset:
      • manufacturer, model, year
      • displ (engine displacement in litres)
      • hwy (miles per gallon on the highway)
      • cty (miles per gallon in the city)
      • cyl (number of cylinders)
      • drv (f = front-wheel drive, r = rear wheel drive, 4 = 4wd)
      • class (type of car)
      • trans (type of transmission)
      • fl (fuel type)

ggplot2

  • Let us plot the relationship between engine size and fuel economy
Code
library(ggplot2)

data(mpg)

ggplot(data=mpg, mapping=aes(x=displ, y=hwy)) +
  geom_point()

  • How would you describe the relationship between displ and hwy?
Code
ggplot(mpg, aes(cty, hwy)) + geom_point()

Code
ggplot(diamonds, aes(carat, price)) + geom_point()

Code
ggplot(economics, aes(date, unemploy)) + geom_line()

Code
ggplot(mpg, aes(cty)) + geom_histogram()

ggplot2

Colour, size, shape and other aesthetic attributes

  • Aesthetics are visual properties of the objects in the plot
    • colour, size, shape, linetype, fill, alpha
  • Aesthetics can be mapped to variables in the data
    • aes(colour=variable)
    • aes(size=variable)
    • aes(shape=variable)
    • aes(linetype=variable)
    • aes(fill=variable)
    • aes(alpha=variable)

ggplot2

Colour, size, shape and other aesthetic attributes

Code
ggplot(mpg, aes(displ, hwy, colour=class)) +
  geom_point()

Code
ggplot(mpg, aes(trans, hwy, colour=class)) +
  geom_point()

ggplot2

labels

  • Labels are important for making your plot understandable
    • xlab() and ylab() functions
    • labs() function
Code
ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color=class)) +
  labs(x="Engine size (litres)",
       y="Highway fuel economy (miles per gallon)",
       title="Relationship between engine size and fuel economy",
       color="Car type",
       caption="Source: mpg dataset")+
  theme_bw()

ggplot2

ggthemes

Code
library(ggthemes)

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color=class)) +
  labs(x="Engine size (litres)",
       y="Highway fuel economy (miles per gallon)",
       title="Relationship between engine size and fuel economy",
       color="Car type",
       caption="Source: mpg dataset")+
  theme_economist()+
  scale_color_tableau()

ggplot2

Facets

  • Facets allow you to create multiple plots that each display a subset of the data
    • facet_wrap() creates a grid of plots
    • facet_grid() creates a matrix of plots
Code
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  facet_wrap(~class)

ggplot2

Plot geoms

  • Geoms are the geometric objects that represent the data in the plot
    • geom_point() creates a scatterplot
    • geom_smooth() creates a smoothed line plot
    • geom_histogram() creates a histogram
    • geom_boxplot() creates a boxplot
    • geom_bar() creates a bar plot
    • geom_line() creates a line plot
    • geom_vline() adds a vertical line to the plot
    • geom_hline() adds a horizontal line to the plot
    • geom_abline() adds a diagonal line to the plot

ggplot2

Adding a smoother to a plot

Code
ggplot(mpg, aes(displ, hwy)) +
  geom_point() +
  geom_smooth(span=0.3)

ggplot2

  • Boxplots, violin and jittered points are useful for visualizing the distribution of a continuous variable
Code
##| layout-ncol: 3
##| fig-width: 4

library(ggpubr)

jitter<- ggplot(mpg, aes(drv, hwy)) + geom_jitter()
boxplot<- ggplot(mpg, aes(drv, hwy)) + geom_boxplot()
violin<- ggplot(mpg, aes(drv, hwy)) + geom_violin()

ggarrange(jitter, boxplot, violin, ncol=3)

ggplot2

Boxplots

Code
ggplot(mpg, aes(class, hwy)) +
  geom_boxplot()+
  labs(title="Highway fuel economy by car type",
       x="Car type",
       y="Highway fuel economy (miles per gallon)")+
  coord_flip()+
  theme_economist()

ggplot2

Bar plots

  • Bar plots are useful for visualizing the distribution of a categorical variable
Code
ggplot(mpg, aes(class)) +
  geom_bar()

Code
ggplot(mpg, aes(class, fill=drv)) +
  geom_bar()

ggplot2

Histograms and density plots

  • Histograms and density plots are useful for visualizing the distribution of a continuous variable
Code
ggplot(mpg, aes(hwy)) +
  geom_histogram() 

Code
ggplot(mpg, aes(hwy)) +
  geom_density()

ggplot2

Histograms and density plots

Code
den<- ggplot(mpg, aes(displ, colour = drv)) + 
  geom_density(linewidth=0.8)
  
hist<- ggplot(mpg, aes(displ, fill = drv)) + 
  geom_histogram(binwidth = 0.5) + 
  facet_wrap(~drv, ncol = 1)

ggarrange(den, hist, ncol=2)

ggplot2

ggsave - save the graph as an image file

Code
ggsave(filename="mpg_displ.png",width=6, height=4)

Final Example - toy imports to the US from 1996-2005

  • it is drawn from Scott (2021)
Code
library(tidyverse)

toy_imports <- read_csv("https://raw.githubusercontent.com/kwan-MSDA/Bootcamp_2024/main/dataset/toyimports.csv")

head(toy_imports)
# A tibble: 6 × 8
  partner  year partner_name       product product_name US_report_import pop2000
  <chr>   <dbl> <chr>                <dbl> <chr>                   <dbl>   <dbl>
1 ARE      1998 United Arab Emira…  950341 "Toys repre…             1.06  3.25e6
2 ARE      2000 United Arab Emira…  950349 "Toys repre…            12.0   3.25e6
3 ARE      2003 United Arab Emira…  950349 "Toys repre…             4.65  3.25e6
4 ARE      2005 United Arab Emira…  950320 "Reduced-si…            49.2   3.25e6
5 ARG      1996 Argentina           950341 "Toys repre…             0     3.69e7
6 ARG      1996 Argentina           950310 "Electric t…            10.8   3.69e7
# ℹ 1 more variable: region <dbl>
  • Task: make a graph showing total toy imports over time for the U.S.’s top 5 trading partners by total dollar value of toys imported

Final Example - toy imports to the US from 1996-2005

Code
country_total<- toy_imports %>% 
  group_by(partner_name) %>%
  summarize(total_import=sum(US_report_import)) %>%
  arrange(desc(total_import)) %>%
  head(5)

country_total
# A tibble: 5 × 2
  partner_name     total_import
  <chr>                   <dbl>
1 China               26842305.
2 Denmark              1034990.
3 Canada                572309.
4 Hong Kong, China      545186.
5 Switzerland           400969.

Final Example - toy imports to the US from 1996-2005

Code
#| out-width: 100%

top5_partners=c("China", "Denmark", "Canada", "Hong Kong, China", "Switzerland")

options(scipen = 999)

library(ggthemes)
library(scales)
library(plotly)

p <- toy_imports %>% 
  filter(partner_name %in% top5_partners) %>%
  group_by(year, partner_name) %>%
  summarize(total_import=sum(US_report_import)) %>% 
  ggplot(aes(year, total_import, color=partner_name)) +
  geom_line()+
  labs(title="Toy imports from the U.S.'s top-5 partners, 1996-2005",
       x="Year",
       y="Dollar value of imports (log scale)",
       color="Import Region")+
  scale_x_continuous(breaks=1996:2005)+
  theme_economist()+ 
  scale_y_log10(breaks = trans_breaks("log10", function(x) 10^x),
              labels = trans_format("log10", math_format(10^.x)))

ggplotly(p)

Extra: Gapminder data

Code
library(gapminder)

data(gapminder)

gapminder %>% 
  group_by(year, continent) %>%
  mutate(median_lifeExp = median(lifeExp)) %>%
  ggplot(aes(year, median_lifeExp, color=continent)) +
  geom_line()+
  labs(title="Life expectancy by continent and year",
       x="Year",
       y="Life expectancy")+
  theme_economist()

Code
ggplot(gapminder, aes(x = continent, y = lifeExp)) +
  geom_boxplot(outlier.colour = "hotpink") +
  geom_jitter(position = position_jitter(width = 0.1, height = 0), alpha = 1 / 4)

Extra: Gapminder data

this is from BBC style

Code
# install.packages('devtools')
#devtools::install_github('bbc/bbplot'))

library(ggpubr)

source("https://raw.githubusercontent.com/kwan-MSDA/R/main/bbc_style.R")

gapminder %>% 
  group_by(year, continent) %>%
  summarize(median_lifeExp = median(lifeExp)) %>%
  ggplot(aes(year, median_lifeExp, color=continent)) +
  geom_line()+
  labs(title="Life expectancy by continent and year",
       x="Year",
       y="Life expectancy")+
  bbc_style()

Extra: Gapminder data

Code
library("ggalt")
library("tidyr")
 
library(gapminder)

dumbbell_df <- gapminder %>%
  filter(year == 1967 | year == 2007) %>%
  select(country, year, lifeExp) %>%
  spread(year, lifeExp) %>%
  mutate(gap = `2007` - `1967`) %>%
  arrange(desc(gap)) %>%
  head(10)
 
#Make plot
ggplot(dumbbell_df, aes(x = `1967`, xend = `2007`, y = reorder(country, gap), group = country)) + 
  geom_dumbbell(colour = "#dddddd",
                size = 3,
                colour_x = "#FAAB18",
                colour_xend = "#1380A1") +
  bbc_style() + 
  labs(title="We're living longer",
       subtitle="Biggest life expectancy rise, 1967-2007")

Extra: Gapminder data

Code
library(hrbrthemes)
library(viridis)

gapminder %>% 
  filter(year==2007) %>%
  mutate(country=factor(country, levels=unique(country))) %>%
  arrange(desc(pop)) %>% 
  ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, fill=continent)) +
  geom_point(alpha=0.6, shape=21, color="black")+
  scale_size(range=c(.1, 24), name="Population (M)")+
  scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A")+
  theme_ipsum()+
  theme(legend.position="none")+
  labs(title="Life expectancy by continent in 2007",
       x="GDP per capita",
       y="Life Expectancy")

Extra: Gapminder data

Code
library(gganimate)

gapminder %>% 
  ggplot(aes(x=gdpPercap, y=lifeExp, size=pop, fill=continent, frame=year)) +
  geom_point(alpha=0.6, shape=21, color="black")+
  scale_size(range=c(.1, 22), name="Population (M)")+
  scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A")+
  theme_ipsum()+
  theme(legend.position="none")+
  labs(title="Life expectancy by continent in {frame_time}",
       x="GDP per capita",
       y="Life Expectancy")+
  geom_text(data=gapminder %>%  filter(pop >1e+8), aes(label=country), size=5, nudge_x=0.1, nudge_y=0.1)+
  transition_time(year)+
  enter_fade()+
  exit_fade()

Code
anim_save("gapminder_gganimate.gif")

Extra: Gapminder data

source

Code
library(plotly)
library(hrbrthemes)
library(viridis)

g<- crosstalk::SharedData$new(gapminder %>% 
                              mutate(country=factor(country, levels=unique(country))) %>%
                              arrange(desc(pop)),
                              ~ continent)
gg<- g %>% 
  ggplot(aes(x=gdpPercap, y=lifeExp, fill=continent, frame=year)) +
  geom_point(aes(size=pop, alpha=0.6, ids=country))+
  scale_size(range=c(.1, 24), name="Population (M)")+
  scale_fill_viridis(discrete=TRUE, guide=FALSE, option="A")+
  scale_alpha(range=c(0.6, 1), guide=FALSE)+
  theme_ipsum()+
  # theme(legend.position="none")+
  labs(title="Life expectancy by continent between 1952-2007",
       x="GDP per capita",
       y="Life Expectancy")

ggplotly(gg, height = 500, width = 800)

References

Scott, James. 2021. “Data Science in r: A Gentle Introduction.” 2021. https://bookdown.org/jgscott/DSGI/.
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23.
———. 2016. Ggplot2 : Elegrant Graphics for Data Analysis. Second edition. Use r! Switzerland: Springer.
———. 2019. Advanced r. Second edition. The r Series. Boca Raton, FL: CRC Press.
Wickham, Hadley, and Garrett Grolemund. 2023. R for Data Science : Import, Tidy, Transform, Visualize, and Model Data. Second edition. Beijing: O’Reilly.